Blogscape
What are the hottest topics on the web right now? How much does your web presence change day-to-day, or in response to an advertising campaign? How many links did your site receive from the blogosphere this week?
Blogscape is a data source built to answer these questions and more. Itβs an βInformation Feed Aggregatorβ and has been monitoring about 10 million feeds since December 2007. These feeds come from any website which offers syndication, but they are mostly from Blogs and News sites.
While Linkscape crawls the web-at-large, Blogscape is focused on the ‘fast-moving-web’. It stores and makes searchable the full content of syndication feeds (including link data!), and the newest data is made available several times every day.
Data from Blogscape will appear in various forms in upcoming SEOmoz products, but weβve decided to release a version of our internal testing tool for it in…
SEOmoz Labs!
SEOmoz Labs is a place where our more adventurous PRO users can check out the βbleeding edgeβ of SEOmoz technology and product design. From now on, Labs will be our showcase for new ideas and features, ranging from simple proof-of-concept tools to early releases of upcoming products. Weβre hoping Labs will be a fun and exciting βsneak peekβ for our users β and a source of early feedback for our product development team.
There are some caveats associated with Labs β because these projects are works-in-progress, data from them could be inaccurate, and the projects themselves could become unavailable temporarily or permanently at any time. Beyond this, we are explicitly not offering user support for labs projects, so as to not take any time away from our βofficial projectsβ.
However, that doesnβt mean we donβt want to hear from you! In the upper right hand corner of each labs project page, there is a bright green button labeled βsend feedbackβ (which is a link to [email protected]). Even though we wonβt always be able to respond, weβd love to receive your comments, compliments, complaints, and bug reports. Feedback will be incorporated into our product design process, and labs projects with a significant amount of positive responses may find their way into formal projects sooner than others.
If youβre PRO and want to press on immediately, click here to check out SEOmoz Labs now! Otherwise, read on below for more details about Blogscape.
Queries and Graphs
The main Labs page for Blogscape has a box for queries, and a graph below to show how many posts match each query over time (the last 30 days, by default).
For example, the query
“Sean Penn”, “Brad Pitt”, “Richard Jenkins”, “Frank Langella”, “Mickey Rourke”
displays the number of mentions of each Academy Award Nominee for βPerformance by an actor in a leading role’. The (somewhat cramped) snapshot below shows the results for this query (taken on Feb 25th, 2009, you can view the live version by clicking here).
Β
So, this graph shows the number of posts which mention each actorβs name for every day in February. The spike in mentions for all queries on Feb 23rd corresponds with the actual date of the Academy Awards. As Iβm sure you know, Sean Penn took the Oscar for this one, and this is clearly reflected in the fact that he received twice as many hits as any other query on that day.
Viewing Posts
Blogscape stores the snippets of text found in each Feed β you can click any data point to view this information. For example, if youβd clicked the on the line for Sean Penn on Feb 23rd, youβd see something like this:
Β
This view shows snippets from each post satisfying that query on that day. Posts are ordered by Blogrank, an internally calculated ranking metric. (Any feed can ‘vote’ for any other feed by linking to the website the feed comes from. Feeds with more votes have higher Blogrank.)
Each post has the following information:
β’Β Β Β The original title of the post (clicking here takes you to the actual post)
β’Β Β Β A snippet of the description of the post
β’Β Β Β The title of the feed the post came from (clicking here takes you to the main page of the feed’s source)
β’Β Β Β The feedβs Blogrank
β’Β Β Β The URL for the feed itself
Advanced Queries
Beyond single terms or phrases in quotes, there are advanced query operators available. For example, you could search for posts containing the word βoscarβ or βoscarsβ with the query
oscar | oscars (open this query)
There are also query operators for finding posts which link to specific URLs, root domains, or subdomains. For example, you could search for posts which link to any URL at the root domain βoscar.comβ with the query
rd:oscar.com (open this query)
A list of all available query operators can be found at the Blogscape help page.
Finally, each graph has option of being weighted by Blogrank (see checkbox on the right of the labs page). This makes the graph more of a measure of the βpopularityβ of a query for any given day, instead of the raw number of matches for it. (Feeds with high Blogrank have many incoming links from other feeds, and tend to come from sources which are viewed by lots of people.)
Data Duplication
You may notice a message at the bottom of the βPostsβ view stating that βPosts very similar to these have been filtered from this list.β Weβve worked hard to battle data duplication in Blogscape by carefully canonicalizing feeds (many sites have several URLs for the same data) and posts within a feed. Nonetheless, there are situations where duplicate data is almost impossible to eliminate in advance (for example, some large sites have many feeds with content that occasionally overlaps).
To battle this problem, Blogscape does additional filtering of posts at query time. This filtering ensures that you see only the most relevant version of a post that occurs in Blogscapeβs data stores multiple times. For this reason, some queries will have higher post counts on the frequency graphs than when viewing the Posts themselves. If you really want to view every post Blogscape has, you can click on the link at the bottom of the page to turn this feature off.
Data Quality
As I mentioned before, Blogscape has been monitoring a sizable portion of the Blogosphere for over a year. Nonetheless, we are striving to improve the quality of data within Blogscape, and weβve very excited about two major upcoming improvements:
1.Β Β Β Monitoring of more high-quality feeds
Weβve added Feed Auto-Discovery logic to our processing of Linkscape crawl data, and will be using the results to make sure Blogscape always monitors the most important blogs from across the web.
2.Β Β Β Crawling of source pages
Based on our research, about half of syndication feeds donβt publish the entire content of their posts in their feeds β instead, they publish a truncated section of their content (or occasionally a hand-written summary of it). Most sites that do this also strip HTML from their feeds.
Itβs important for data quality to ensure that queries for a term return all posts mentioning that term, and itβs important for SEO that all link information is present. For these reasons, weβre adding functionality to Blogscape that will follow links from syndication feeds, and store the actual source content for future search. (Of course, the upcoming crawler will politely ignore sites which block it using Robots.txt β details on this will be released when the crawler goes live.)
Movers and Shakers
Finally, an interesting use of the mountains of data stored by Blogscape is the search for hot trends, or βMovers and Shakersβ. You can see the results of this process for several categories by clicking on the βMovers and Shakersβ links in the upper right hand corner of the Blogscape Labs page.
They tend to be most interesting (and stable) in weekly increments β you can view the top βmover and shakerβ phrases for this week here. On the day of writing this post, the top phrase is βSafari 4 beta,β which rose 26,632.4% this week (percent change over rank-weighted graphs). Right behind it is βGary Locke,β which rose 25,101.7% over last week. On the labs page for this feature, you can click through and view the graph for each individual βmover and shakerβ.
Conclusion
Weβre excited to launch this feature, and even more excited about the data quality improvements weβll be making on it in the next few months. If youβre PRO, check it out, and send your comments our way!